Goto

Collaborating Authors

 future prediction



Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes

Neural Information Processing Systems

Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions.However, the neural mechanisms underlying these computations are unclear.We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts that contain thousands of comparisons to directly impinge on this question.Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-slot objectives, to models that future predict in the latent space of purely static image-pretrained or dynamic video-pretrained foundation models.We find that ``scale is \emph{not} all you need'', and that many state-of-the-art machine learning models fail to perform well on our neural and behavioral benchmarks for future prediction.In fact, only one class of models matches these data well overall.We find that neural responses are currently best predicted by models trained to predict the future state of their environment in the \emph{latent} space of pretrained foundation models optimized for \emph{dynamic} scenes in a self-supervised manner.These models also approach the neurons' ability to predict the environmental state variables that are visually hidden from view, despite not being explicitly trained to do so.Finally, we find that not all foundation model latents are equal.Notably, models that future predict in the latent space of video foundation models that are optimized to support a \emph{diverse} range of egocentric sensorimotor tasks, reasonably match \emph{both} human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test.Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation have strong inductive biases associated with them, and are thus far most consistent with being optimized to future predict on \emph{reusable} visual representations that are useful for Embodied AI more generally.



Do Language Models Use Their Depth Efficiently?

Csordás, Róbert, Manning, Christopher D., Potts, Christopher

arXiv.org Artificial Intelligence

Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.


DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation

Tian, Jingyi, Wang, Le, Zhou, Sanping, Wang, Sen, Li, Jiayi, Hua, Gang

arXiv.org Artificial Intelligence

Learning generalizable robotic manipulation policies remains a key challenge due to the scarcity of diverse real-world training data. While recent approaches have attempted to mitigate this through self-supervised representation learning, most either rely on 2D vision pretraining paradigms such as masked image modeling, which primarily focus on static semantics or scene geometry, or utilize large-scale video prediction models that emphasize 2D dynamics, thus failing to jointly learn the geometry, semantics, and dynamics required for effective manipulation. In this paper, we present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features via masked reconstruction and future prediction using differentiable volumetric rendering. By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation. The learned representations can be effectively transferred to downstream robotic manipulation tasks via action value map prediction. We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, as well as in real-world robotic experiments, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.


WildSmoke: Ready-to-Use Dynamic 3D Smoke Assets from a Single Video in the Wild

Liu, Yuqiu, Song, Jialin, Savva, Manolis, Chen, Wuyang

arXiv.org Artificial Intelligence

We propose a pipeline to extract and reconstruct dynamic 3D smoke assets from a single in-the-wild video, and further integrate interactive simulation for smoke design and editing. Recent developments in 3D vision have significantly improved reconstructing and rendering fluid dynamics, supporting realistic and temporally consistent view synthesis. However, current fluid reconstructions rely heavily on carefully controlled clean lab environments, whereas real-world videos captured in the wild are largely underexplored. We pinpoint three key challenges of reconstructing smoke in real-world videos and design targeted techniques, including smoke extraction with background removal, initialization of smoke particles and camera poses, and inferring multi-view videos. Our method not only outperforms previous reconstruction and generation methods with high-quality smoke reconstructions (+2.22 average PSNR on wild videos), but also enables diverse and realistic editing of fluid dynamics by simulating our smoke assets. We provide our models, data, and 4D smoke assets at [https://autumnyq.github.io/WildSmoke](https://autumnyq.github.io/WildSmoke).



Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes

Neural Information Processing Systems

Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions.However, the neural mechanisms underlying these computations are unclear.We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts that contain thousands of comparisons to directly impinge on this question.Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-slot objectives, to models that future predict in the latent space of purely static image-pretrained or dynamic video-pretrained foundation models.We find that scale is \emph{not} all you need'', and that many state-of-the-art machine learning models fail to perform well on our neural and behavioral benchmarks for future prediction.In fact, only one class of models matches these data well overall.We find that neural responses are currently best predicted by models trained to predict the future state of their environment in the \emph{latent} space of pretrained foundation models optimized for \emph{dynamic} scenes in a self-supervised manner.These models also approach the neurons' ability to predict the environmental state variables that are visually hidden from view, despite not being explicitly trained to do so.Finally, we find that not all foundation model latents are equal.Notably, models that future predict in the latent space of video foundation models that are optimized to support a \emph{diverse} range of egocentric sensorimotor tasks, reasonably match \emph{both} human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test.Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation have strong inductive biases associated with them, and are thus far most consistent with being optimized to future predict on \emph{reusable} visual representations that are useful for Embodied AI more generally.


Data-Efficient Inference of Neural Fluid Fields via SciML Foundation Model

Liu, Yuqiu, Xu, Jingxuan, Soroco, Mauricio, Wei, Yunchao, Chen, Wuyang

arXiv.org Artificial Intelligence

Recent developments in 3D vision have enabled successful progress in inferring neural fluid fields and realistic rendering of fluid dynamics. However, these methods require real-world flow captures, which demand dense video sequences and specialized lab setups, making the process costly and challenging. Scientific machine learning (SciML) foundation models, which are pretrained on extensive simulations of partial differential equations (PDEs), encode rich multiphysics knowledge and thus provide promising sources of domain priors for inferring fluid fields. Nevertheless, their potential to advance real-world vision problems remains largely underexplored, raising questions about the transferability and practical utility of these foundation models. In this work, we demonstrate that SciML foundation model can significantly improve the data efficiency of inferring real-world 3D fluid dynamics with improved generalization. At the core of our method is leveraging the strong forecasting capabilities and meaningful representations of SciML foundation models. We equip neural fluid fields with a novel collaborative training approach that utilizes augmented views and fluid features extracted by our foundation model. Our method demonstrates significant improvements in both quantitative metrics and visual quality, showcasing the practical applicability of SciML foundation models in real-world fluid dynamics.


Reviews: Flexible neural representation for physics prediction

Neural Information Processing Systems

The authors propose a novel hierarchical object representation based on particles to cover both rigid geometrical shapes and deformable materials. Each scene is represented as a graph, with disconnected components corresponding to the objects and the support of the scene. Each graph has a tree-like structure, where higher levels correspond to coarser scales, and the leaves correspond to the original particles placed in the object. They also propose an adapted neural network architecture, called Hierarchical Relation Network, that learns to predict physical dynamics for this representation. This multiscale approach is end to end differentiable, allowing this propagation mechanism to be learned.